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Abstract 


The matrix completion problem consists in reconstructing a matrix from a sample of entries, possi¬ 
bly observed with noise. A popular class of estimator, known as nuclear norm penalized estimators, 
are based on minimizing the sum of a data fitting term and a nuclear norm penalization. Here, we 
investigate the case where the noise distribution belongs to the exponential family and is sub¬ 
exponential. Our framework alllows for a general sampling scheme. We first consider an estimator 
defined as the minimizer of the sum of a log-likelihood term and a nuclear norm penalization and 
prove an upper bound on the Frobenius prediction risk. The rate obtained improves on previous 
works on matrix completion for exponential family. When the sampling distribution is known, we 
propose another estimator and prove an oracle inequality w. r. t. the Kullback-Leibler prediction risk, 
which translates immediatly into an upper bound on the Frobenius prediction risk. Finally, we show 
that all the rates obtained are minimax optimal up to a logarithmic factor. 

Keywords: Low rank matrix estimation; matrix completion; exponential family model; nuclear 
norm 

1. Introduction 

In the matrix completion problem one aims at recovering a matrix, based on partial and noisy ob¬ 
servations of its entties. This problem arises in a wide range of practical situations such as col¬ 
laborative filtering or quantum tomography (see Srebro and Salakhutdinov (2010) or Gross (2011) 
for instance). In typical applications, the number of observations is usually much smaller than the 
total number of entries, so that some structural constraints are needed to recover the whole matrix 
efficiently. 

More precisely, we consider an mi x m 2 real matrix X and observe n samples of the form 
{Yi^oji)'^^i, with (wj)f=i G (["it] X [" 12 ])"^ an i.i.d. sequence of indexes and (Ti)r=i G a se¬ 
quence of observations which is assumed to be i.i.d. conditionally to the entries {Xiji)^=i. To 
recover the unknown parameter matrix X, a popular class of methods, known as penalized nuclear 
norm estimators, are based on minimizing the sum of a data fitting term and a nuclear norm pe¬ 
nalization term. These estimators have been extensively studied over the past decade and strong 
statistical guarantees can be proved in some particular settings. When the conditional distribution 
Yi\Xl^^ is additive and sub-exponential it can be shown that the unknown matrix can be recovered 
efficiently, provided that it is low rank or approximately low rank, see Candes and Plan (2010); 
Keshavan et al. (2010); Koltchinskii et al. (2011); Negahban and Wainwright (2012); Cai and Zhou 
(2013a); Klopp (2014). In that case, the prediction error satisfies with high probability 
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with X denoting the estimator, || • 11^,2 the Frobenius norm and rk(-) the rank of a matrix. It has been 
proved by Koltchinskii et al. (2011) that this rate is actually minimax optimal up to a logarithmic 
factor. 

Although very common in practice, discrete distributions have received less attention. The 
analysis of a logistic noise was first addressed by Davenport et al. (2012). It was later considered 
by Cai and Zhou (2013b), Lafond et al. (2014) and Klopp et al. (2014) who have shown that the 
prediction error is also of the order of (1), for log-likelihood estimators, regularized with nuclear 
norm. Gunasekar et al. (2014) have investigated the case of distributions belonging to the exponen¬ 
tial family, which is rich enough to encompass both continuous and discrete distributions (Gaussian, 
exponential, Poisson, logistic, etc.). They provide (see their Corollary 1) an upper bound for the pre¬ 
diction error when the noise is sub-Gaussian and the sampling uniform. However, this bound is of 
the form 


where is of the order mim 2 (see Remark 7 below for more details). Therefore, the obtained 
rate does not match (1), which suggests that there may have some room for improvement. 

In the present work, we further investigate the case of exponential family distributions and show 
that under some mild assumptions, the rate (1) holds and is minimax optimal up to a logarithmic 
factor. A matrix completion estimator, defined as the minimizer of the sum of a log-likelihood term 
and a nuclear norm penalization term, is first considered. Provided that the noise is sub-exponential 
and the sampling distribution satisfies some assumptions controlling its deviation from the uniform 
distribution, it is proved that with high probability, the prediction error is upper bounded by the 
same rate as in the Gaussian setting (1). It should be noticed that the sub-exponential assumption is 
satisfied by all the above mentioned distributions. 

When the additional knowledge of the sampling distribution is available, we consider another 
estimator, which is inspired by the one proposed by Koltchinskii et al. (2011) in the additive sub¬ 
exponential noise setting. We adapt their proofs to the exponential family distributions and show 
that this estimator satisfies an oracle inequality with respect to the Kullback-Leibler prediction risk. 
The proof techniques involved are also closely related to the dual certificate analysis derived by 
Zhang and Zhang (2012). With high probability, an upper bound on the prediction error, still of the 
same order as in (1), is derived from the oracle inequality . Finally, it is proved that the previous 
upper bound order is in fact minimax-optimal up to a logarithmic factor. 

The rest of the paper is organized as follows. In Section 2.1, the model is specified and some 
background on exponential family distributions is provided. Then we give an upper bound for 
log - likelihood matrix completion estimator in Section 2.2 and an oracle inequality (also yielding 
an upper bound) for the estimator with known sampling scheme in Section 2.3. Finally, the lower 
bound is provided in Section 2.4. The proofs of the main results are gathered in Section 3 and the 
most technical Lemmas and proofs are deferred to the Appendix. 

Notation 

Throughout the paper, the following notation will be used. For any integers n, mi, m 2 > 0, [re] := 
{1,... ,re}, mi V m 2 := max(mi,m 2 ) and mi A m 2 := min(mi,m 2 ). We equip the set of 
mi X m2 matrices with real entries (denoted by the Hilbert-Schmidt inner product 

{X\X') := ti{X^X'). For a given matrix X G ^^ 1 x^ 2 ^ we write ||X||(X) := maxjj \Xi^j\ and for 


^ *2("ii+"^2)rk(3f)log(mi-hm2) 

- = U \ a - 

m\m2 \ re 
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any s > 1, we denote its Schatten s-norm (see Bhatia (1997)) by 

( miAm2 

E 

i=i 

with ai(X) the singular values of X, ordered in decreasing order. We use the convention ||X||cr,oo = 
ai{X). For any vector z := diag( 2 :) denotes the diagonal matrix whose diagonal 

entries are zi, • • • ,Zn- For any convex differentiable function G : R —^ R and x,x' G R, the 
Bregman divergence of G is denoted by 

dcix, x') := G{x) — G{x') — G'{x'){x — x') . (2) 



2. Main results 

2.1. Model Specification 

We consider an unknown parameter matrix X G recovering. Assume that 

an i.i.d. sequence of indexes ^ ^ [^- 2 ])” is sampled and denote by 11 its distribution. 

The observations associated to this sequence are denoted by assumed to follow a natural 

exponential family distribution, conditionally to the X entries, that is: 

Yi\X^, ~ Exp;,^g(XJ := KYi) exp {X^Y, - G{X^,)) , (3) 

where h and G are the base measure and log partition functions associated to the canonical repre¬ 
sentation. For ease of notation we often write Xi instead of X^^. 

Given two matrices G define the empirical and integrated Bregman diver¬ 

gences as follows 

1 

^7S(X^X2) = -Y,dG{X},Xf) and D%{X\ X^) = nDh{X\X^)] . (4) 

2=1 

Note that for exponential family distributions, the Bregman divergence dci-,-) corresponds to the 
Kullback-Leibler divergence. Let Pjfi {resp. Px2) denote the distribution of (yi,a;i) associated 
to the parameters X^ {resp. X^); then Dq{X^,X'^) is the Kullback-Leibler divergence between 
P^i and Pj)j '2 conditionally to the sampling, whereas Dq{X^,X‘^) is the usual Kullback-Leibler 
divergence. 

As reminded in introduction, the exponential family encompasses a wide range of distributions, 
either discrete or continuous. Some information on the most commonly used is recalled below. 


Remark 1 If G is smooth enough, a simple derivation of the density shows that its successive 
derivatives can be used to determine the distribution moments. Thus, when G is twice differentiable, 
E[Yi\Xi] = G'{Xi) andXar[Yi\Xi] = G"(Xi) hold. 
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Dislribufion 

Parameter x 

G(x) 

Gaussian: a^) (a known) 

11 la 

cj^xV2 

Binomial: (p) (N known) 

log(p/(l -p)) 

Xlog(l -b e^) 

Poisson: V{X) 

log(A) 


Exponential: S{X) 

-A 

-log(-x) 


Table 1: Parametrization of some exponential family distributions 


2.2. General Matrix Completion 

In this section, we provide statistical guarantees on the prediction error of a matrix completion 
estimator, which is defined as the minimizer of the sum of a log-likelihood term and a nuclear norm 
penalization term. For any X G denote by <I>y(^) the (normalized) conditional negative 

log-likelihood of the observations: 

1 ” 

$y(X) = — V (log(h(Yi)) + X,Yi - G{Xi)) . (5) 

n 

i=l 

For 7 > 0 and A > 0, the nuclear norm penalized estimator X is defined as follows: 

X= argmin <l>y(X) , where <I>y(X) = $y(-^) + A||X||o-,i • (6) 

The paramefer A confrols fhe frade off befween filling Ihe dala and privileging a low rank solution: 
for large value of A, fhe rank of X is expecled lo be small. 

Before giving an upper bound on fhe prediction risk ||X — X||^ 2 > the following assumptions on 
fhe noise and sampling dislribulions need lo be infroduced. 

HI The function x G{x), is twice differentiable and strongly convex on [—7,7], so that there 

exists constants ^ satisfying: 

< G'fx) < , (7) 

for any x G [—7,7]. 

Remark 2 Under 1, for any x,x' G [—7,7], the Bregman divergence satisfies ^{x — x')'^ < 
2dG{x, x') < d‘^{x - x')^- 

Remark 3 If the observations follow a Gaussian distribution, the two convexity constants are equal 
to the standard deviation i.e., = a (see Table 1 ). 

For fhe sampling dislribufion, one needs lo ensure lhal each enlry has a sampling probabilily, 
which is lower bounded by a slriclly positive conslanl, lhal is: 

H2 There exists a constant /r > 1 such that, for all mi, m 2 , 

min -Kk,i > l/ipmim 2 ) , where 'Kk,i '■= F(wi = {k,l)) . (8) 
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Denote by '^k,i (resp. Ci = '^k,i) the probability of sampling a coefficient from 

row k {resp. column 1). The following assumption requires that no line nor column should be 
sampled far more frequently than the others. 


H3 There exists a constant a >1 such that, for all mi, m 2 , 


mayi{Rk,Ci) < 
k,l 


mi A m 2 


Remark 4 In the classical case of a uniform sampling, ^ = 0 = 1 holds. 


We define fhe sequence of mafrices whose entries are all zeros except for the coefficient 

(wj) which is equal to one i.e., Ei := ek^{e[y‘ with {ki,li) = Ui and {ekf^li {resp. 
being the canonical basis of {resp. M^^). Furthermore, for a Rademacher sequence 

independent from (a;*, we also define 


Er: = 


1 "" 

^ ^iEi 

n ' 


2=1 


(9) 


and use fhe following notation 


d = mi + m 2 , M = mi V m 2 , m = mi A m 2 . 


( 10 ) 


With these assumptions and notation, we are now ready for stating our main results. 


Theorems Assume 1, 2, ||X||oo < ^ and \ > 2\\SI ^Y{X)\\a^oQ- Then with probability at least 
1 — 2d~^ the following holds: 


mim2 


< C^^max [ mim 2 r. 




with Tjr and d defined in (9) and (10) and C a numerical constant. 


Proof See Section 3.1. ■ 

In Theorem 5, the term E||Sij||cr,oo only depends on the sampling distribution and can be upper 
bounded using assumption 3. On the other hand, the gradient term || V <I>y(X) ||f 7 ,oo depends both on 
the sampling and on the observation distributions. In order to control this term with high probability, 
the noise is assumed to be sub-exponential. 

H4 There exist a constant > 0 such that for all x G [— 7 , 7 ] and Y 


E 


exp 


|^ |y-G^(x)i y 


< e 


Then Theorem 5, 3 and 4 yield together the following result. 


( 11 ) 
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Theorem 6 Assume 1, 2, 3, 4, \\X Iloo < 7. 


n > 2 \og{d)mv 


-1 


max 


52 

^log2(A 


a. 


7 



and take A = 2c^a-^\j2v log{d) / (mn), where is a constant which depends only on 5^. Then 
with probability at least 1 — 2>d~^ the following holds: 


\X - ^ 115,2 , 22 2 

- < Cil max 

mim2 



i'vk{X) M log{d) 7 ^ 
n ’ p 


log{d) 


n 


with C a numerical constant. 


Proof See Section 3.2. 


Remark 7 When 7 is treated as a constant and n is large, the order of the bound is 

11 ^2 ^ ^ / rk(X)Mlog(d) \ 
mim2 V ^ ' 


which matches the rate obtained for Gaussian distributions (1). Matrix completion for exponen¬ 
tial family distributions was considered in the case of uniform sampling fi.e., p = v = 1) and 
sub-Gaussian noise by Gunasekar et al. (2014). They provide the following upper bound on the 
estimation error 


\X-X 


|2 

I(t,2 


= o 




mim2 


n 


with a* satisfying a* > y/mim 2 \\X\\oo. Therefore, Theorem 6 improves this rate by a factor 

mim2- 


Remark 8 In the proof, noncommutative Bernstein inequality for sub-exponential noise is used 
to control ||V <&y(^)||(t,oo- However, when the observations are uniformly bounded fe.g., logistic 
distribution), a uniform Bernstein inequality can be applied instead, leading in some cases to a 
sharper bound (see Koltchinskii et al. (2011 ) and Lafond et al. (2014) for instance). 


2.3. Matrix Completion with known sampling scheme 

When the sampling distribution 11 is known, the following estimator can be defined: 


X := argmin <hY(-^) +-^||-^||cr,i with 
XeR’"i><"*2,||X||oo<7 

nil 


( 12 ) 


:= G^\X) - 


n 


and G^\X) := E 


EEi 


n 


In the case of sub-exponential additive noise, Koltchinskii et al. (2011) proposed a similar estimator 
and have shown that it satisfies an oracle inequalify w.r.t. fhe Frobenius prediction risk. Note fhaf 
fheir esfimafor coincides wifh (12) for fhe particular setting of Gaussian noise. The main inferesf of 
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computing X instead of X, when the sampling distribution is known, lies in the fact that a sharp 
oracle inequality can be derived for X. This powerful tool allows to provide statistical guarantees 
on the prediction risk, even if the true parameter X does not belong to the class of estimators 
i.e., when ||X|| < 7 is not satisfied. In this section, it is proved that X satisfies an oracle inequality 
w.r.t. the integrated Bregman divergence (see Definition (4)), which corresponds to the Kullback- 
Leibler divergence for exponential family distributions. An upper bound on the Frobenius prediction 
risk is then easily derived from this inequality. 


Theorem 9 


Assume 1, 2 and A > l|V4>?(X)|| 

(T,00- 

dJ^(X,X)< inf 

XgR™lX’"2,||X||oo<7 


Then the following inequalities hold: 
(Dg(X,X) + 2A||X|U,i) 


(13) 


and 


D%{X,X)< inf 

XgR™ix™2,||X||oo<7 


|^Dn(X,X) + 


(^) 




O'; 



(14) 


Proof The proof of Theorem 9 is an adaptation (to exponential family distributions) of the proof by 
Koltchinskii et al. (2011), which uses the first order optimality conditions satisfied by X. Similar 
argumenfs are used by Zhang and Zhang (2012) fo provide dual cerfificafes for non smoofh convex 
opfimizafion problems. The defailed proof is given in Appendix C.l. ■ 

When ||X||oo < 7, the previous oracle inequalities imply fhe following upper bound on fhe predic- 
fion risk. 


Theorem 10 Assume 1, 2 and A > ||V 4>Y(Ai)||o-,oo cind ||2f||oo < 7- Then the following holds: 

>2 

mi m.o „ _ 4 _ \ 

(15) 


IX-X\\1, j . 

< /i mm 


mim2 








Proof Applying Theorem 9to X = X and using 2 and 1 yields fhe resulf. 


As for fhe previous estimator, fhe term ||V <1 >y(-A)||ct^oo is stochastic and depends bofh on fhe 
sampling and observations. Assuming fhaf fhe sampling disfribufion is uniform and fhaf fhe noise is 
sub-exponenfial allows fo confrol if wifh high probabilify. Before sfafing fhe resulf, lef us define 

L-y := sup \G'{x)\ . (16) 

a:e[- 7 , 7 ] 


Theorem 11 


Assume that the sampling is i.i.d. uniform and ||2f ||oo < 7- Suppose 1, 4, and 


n > 21og(d)mmax 


^log2((i7 
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Take A = (c-ya-y + c*L^)-\j2 log{d) / (mn), where is a constant which depends only on 5-^, is 
defined in (16) and c* is a numerical constant. Then, with probability at least 1 — 2d~^ the following 
holds: 


mim2 ~ V ^7 


n 


with C a numerical constant. 


Remark 12 For simplicity we have considered here only the case of uniform sampling distributions. 
However if we assume that the sampling satisfies 2, 3 and that there exists an absolute constant 
p such that 11^,1 < pj for any mi,m 2 G M, then it is clear from the proof that the same 

bound still holds for a general i.i.d. sampling, up to factors depending on p, v and p. 

Remark 13 If 7 is treated as a constant, the rate obtained for the Frobenius error is the same as 
in Theorem 6. If not, the two rates might differ because the rate of Theorem 11 depends on the 
constant which does not appear in Theorem 6. Note in addition that Remark 8 also applies to 
Theorem II. 

Proof The proof is similar to the one of Theorem 6 , see Appendix C.2. ■ 


2.4. Lower Bound 

It can be shown that the upper bounds obtained in Theorems 6 and 11 are in fact lower bounds (up to 
a logarithmic factor) when 7 is treated as a constant. Before stating the result, let us first introduce 
the set F{r, 7 ) of matrices of rank at most r whose entries are bounded by 7 : 

F{r,y) = {X G ; rank(X) < r, ||A||oo < 7} • 

The infimum over all estimators X that are measurable functions of the data (wj, 1^)”,^^ is denoted 
by inf^. 


Theorem 14 There exists two constants c > 0 and 0 > 0 such that, for all mi, m 2 > 2, 1 < r < 
mi A m 2 , and 7 > 0, 


inf sup Pjjf 
^ XeJ'(r,7) 


mim2 


> cmin 



>0, 


Remark 15 Theorem 14 provides a lower bound of order 0{Mr/The order of the ratio 
between this lower bound and the upper bounds of Theorem 6 is (c-^(d'y,/ay,)^ log(d) V d^). If y is 
treated as a constant, lower and upper bounds are therefore the same up to a logarithmic factor. 


Proof See Section 3.3. 
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3. Proofs of main results 

ForX G denote by 5i(X) C {resp. S 2 {X) C M”*^) the linear spans generated by left 

(resp. right) singular vectors of X. Let Pg±(^x) Ps^{x)^ denotes the orthogonal projections 

on (resp. S^{X)). We then define the following orthogonal projections on 

Vx ■ X i-A Ps^{x)^Ps^{x) ^x : X X — Vx{X) . (17) 


3.1. Proof of Theorem 5 

From Definition ( 6 ), <hy (X) < <l>y (X) holds, or equivalently 

D^{X, X) < A(||X|U,i - ||X|U,i) - (V ci>Y(X) I X - X) , 

with Dq{-, •) defined in (4). The firsf ferm of fhe righf hand side can be upper bounded using 
Lemma 16-(iii) and the second by duality (between || • H^,! and || • ||ct,cxd) and the assumption on A, 
which yields 

DS(X, X) < A (|| PxiX - X)||<,,i + - X||<,,i) . 

Using Lemma 16-(ii) to bound the first term and Lemma 17-(ii) for the second, leads to 


D2;{X,X) < 3A^2rk(X)||X - X|U ,2 ■ 

On the other hand, by strong convexity of G ( 1), we get 

_ 1 ” _ o 

A^(X,X) :=-J2{X^-X,f < DS(X,X) . 
n ^ erf 

1=1 —' 


(18) 

(19) 


We then define the threshold /? := 8 e 7 ^i/log((i)/n and distinguish the two following cases. 
Case 1 If Efeig[mi]x[m2] '^kiiXki - Xkif < 13, then Lemma 18 yields 


X - X 


l|2 

I Iff,2 


mim2 


< pI3 . 


( 20 ) 


Case 2 If Ekie[rm] x[m 2 ] '^ki{Xki — Xkif' > (3, then Lemma I7-(ii) and Lemma 18 combined 
together give 

X G C{/3, 32pmim2 rk(X)), where C(-, ■) is the set defined as 


C{/3,r) := |xg]R'"i^™2| ||X - X||,,i < y^rE [A2.(X, X)]; E [A|.(X, X)] >/jj . (21) 

Hence, from Lemma 19 if holds, with probability at least 1 — (d — 1)’ ^ > 1 — 2 d that 

A^(X,X) > ifi [Af.(X,X)] -512e(E||SR|U,oo)Wim2rk(X) . (22) 

Combining (22) with (19), (18) and Lemma 18 leads to 

- 512e(E||Sij||^ oo)^fim-im 2 rk(X) < ^ J2mim2 rk(X) _ ^23) 

2pmim2 ^ y/mim2 

Using the identity ab < + 6^/4 in (23) and combining with (20) achieves the proof of Theorem 5. 
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Lemma 16 For any pair of matrices X, X £ M'"! /j^r^g 

(i) \\X+Vj,{X)\\^,i = \\X\\^y + \\Vj,{X)\\^y , 

(ii) \\Vx{X)\\.,i < V 2 rk(X)||X ||,,2 , 

(Hi) ||X|U,i - ||X||,,i < II Vx{X - X)|U,i . 

Lemma 17 Let X,X G ]^mixm 2 satisfying ||X||oo < 7 and ||^||oo < 7- Assume that A > 
2||V^>Y(X)||a,oo and^^{X) < Then 

(i) ||Pj(X-X)|U,i<3||P^(X-X)|U,i, 

(ii) ||X - X|U,i < 4^2rk(X)||(X - X)|U,2 . 

Lemma 18 Under 2, for any X G it holds 

V TTkl{Xkl - Xkl)^ > -^- ||X-X||2 2. 

umtmo 

klG[mi] X [m2] 


Lemma 19 For j3 = ■\J\og{d)/n, with probability at least 1 — (d — 1) we have for all 

XGCif3,r): 

I A^(X, X) - E [A^(X, X)] I < + i 6 e(E||Sfi|U,oo)' r , 

with C{(3, r) defined in (21). 

Proof Lemmas 16 and 17 are proved in Appendix A. Lemma 18 follows directly from 2. See 
Appendix B for the proof of Lemma 19. ■ 


3.2. Proof of Theorem 6 


Starting from Theorem 5 one only needs to control E(||Eij| 1 ^, 00 ) and ||V d>Y(X)||CT^oo to obtain the 
result. 

Control of E(||Sij||o-,cxD): One can write S/j := n~^ Y17=i with Zj := SiEi which satisfies 
E[Zj] = 0. Recalling the definitions ~ Yl'k=i'^k,i ^ ^ V^i]^ 

I G [m 2 ], one obtains 


E 


n 


2=1 


< ||diag((i?fc)^ij 


u 

< -, 
m 


(J,00 


(24) 


where H3 was used for the last inequality. Using a similar argument one also gets ||E[^”^j^ Zj Z*] ||ct,oo/''^ < 
vjm. Hence applying Lemma 20 with f7 = 1 and = u/m, for n > m log(d)/ (do) yields 


]E[||Sr|U,oo] <c* 


lev log(d) 


mn 


(25) 
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with c* a numerical constant. 

Control of ||V ^>y(^)||o-,oo: Let us define Z'- := {Yi — G'{Xi))Ei, which satisfies V <hY(^) := 
n~^ — 0 score function) and 


z[{z'S^]\Uo.j . 

Using H4, a similar analysis yields cj|, < jvti. On the other hand, max^ Q) > \lm and 
E[(yj — G'{Xi)Y] = G"{Xi) > gives (t|, > g^lm. Applying Proposition 21 for t = log(d) 
gives with probability at least 1 — d~^ 


a^' '■= max 


-\nY.{z'^~^ z'^ 


i=l 


1 

z=l 


|V <hY(y)||cr,oo < C^max < a. 




21 og(d) . .^7 

,(i7log(^ 


21 og(d) 1 


n 


O', 


n 


(26) 


with C 7 which depends only on 5^. By assumption on n, the left term dominates. Therefore taking A 
as in Theorem 6 statement yields A > 2||V <hY(-^)||(T,oo with probability at least 1 — d~^. A union 
bound argument combined to Theorem 5 achieves Theorem 6 proof. 


Lemma 20 Consider a finite sequence of independent random matrices (yi)i<j<n G 
satisfying E[Zj] = 0 and for some U > 0, ||(j,oo < U for alii = 1,..., n and define 


az ■= max 


^ n 1 ^ 


Z=1 
2 


(J,CXD 


1=1 


( 7,00 ^ 


Then, for any n > (U^ \og{d))/{9az) the following holds: 


E 


1 

n 

2=1 


^ * 
<caz 


2 e log((i) 


n 


with c* = 1 + \/3- 


Proof See Klopp et al. (2014)[Lemma 15]. 


Proposition 21 Consider a finite sequence of independent random matrices (Zi)i<i<n G 
satisfying E[Zj] = 0. For some U > 0, assume 

inf{5 > 0 : E[exp(||yj||o-,oo/5)] < e} < U for i = 

and define az cis in Lemma 20. Then for any f > 0, with probability at least 1 — e“* 


n 

i=l 


< cjj max < az 


( 7,00 


f + log(d) ^^^^^^ U f + log(d) ] 
n ’ ^az n \ 


with Cjj a constant which depends only on U. 


Proof This result is an extension of the sub-exponential noncommutative Bernstein inequality 
(Koltchinskii, 2013, Theorem 4), to rectangular matrices by dilation, see (Klopp, 2014, Proposi¬ 
tion 11 ) for details. ■ 
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3.3. Proof of Theorem 14 

We start with apacking set construction, inspired by Koltchinskii et al. (2011). Assume w.l.o.g., that 
mi > 1712- Let a G (0,1/8) and define k := min(l/2, .^avnxrj(2'^a^^fn) and the set of matrices 


£ = {L = {lij) G : lij G {0,K7} ,Vf G [mi], Vj G [r]} . 


Consider the associated set of block matrices 

C! = |l' = ( L I • • • I L I O ) G : L G £}, 

where O denotes the mi x (m 2 — r [m 2 /r\) zero matrix, and [xj is the integer part of x. The 
Varshamov-Gilbert bound ((Tsybakov, 2009, Lemma 2.9)) guarantees the existence of a subset 
A C CJ with cardinality Card(^) > + 1 containing the null matrix and such that, for 

any two distinct elements and X‘^ of A, 


> 


2 2 

rriir k 7 


m 2 


L r J 


> 


mim2 K^7^ 

16 


(27) 


By construction, any element of A as well as the difference of any two elements of A has rank at 
most r, the entries of any matrix in A take values in [0, 7] and thus .4. C X(r, 7). For some X ^ A, 
we now estimate the Kullback-Leibler divergence D (PxljF^o) between probability measures 
and Px. By independence of the observations (L^, a;i)[L^ and since the distribution of lijwj belongs 
to the exponential family one obtains 


D (PxIlPxo) = [G\X^,){X^, - J - G{X^,) + G(x0j] . 

Since X^^ = 0 and either X^^-^ = 0 or = kj, by strong convexity and by definition of k one 
gets 

_ 2 

D(Px||Pxo) < < alog2(Card(.4) - 1) , 

Z o 

which implies 

^ ^ (^xo||Px) < «log (Card(.A) - l) . (28) 

^ ' XeA 

Using (27), (28) and (Tsybakov, 2009, Theorem 2.5) together gives 


inf sup Px 

^ XGjF(r,7) 


mim2 


> cmin 



aMr ) \ 

J j 


> 6{a,M) , 


where 


S{a,M) 2a 2 ]/rMlog{2)) ’ 

and c is a numerical constant. Since we are free to choose a as small as possible, this achieves the 
proof. 
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Appendix A. Proof of Lemma 16 and Lemma 17 

Lemma 16 

Proof \f A,B G ]g"iix»«2 (yyQ matrices satisfying 5i(A) ± Si{B), i = 1,2, (see Definition 
(17)) then ||A -|- = ||A||o-,i + ||i?||cr,i- Applying this identity with A = X and B = Pj^(Y), 

we obtain 

||Y + Pi(Y)|U,i = ||Y|U,i + II Pi(Y)|U,i , 

showing (i). 

From the definition of Vx{-), 'PxiX) = Ps.^^(^x)^Ps^(x) + ^Ps 2 (x) holds and therefore 
rk(Px(A)) < 2rk(Y). On the other hand, the Cauchy-Schwarz inequality implies that for any 
matrix A, ||A||o-,i < ^rk(A) 11(711^,2- Consequently (ii) follows from 

\\Vx{X)\\^,i < y/2rk(X)||iPx(A)|U,2 < y/2rk(A)||Y|U,2 • 

Finally, since X = X + Vx{X — X) + VxiX — X) we have 

||A|U,i > \\X + Vjc{X - X)\\^p - \\Vx{X - X)\\^p , 

= ||A|U,i + II V^{X - A)|U,i - II Vx{X - A)|U,i , 


leading to (iii). ■ 

Lemma 17 

Proof Since <f>y(A) < <f>y (X), we have 

4>y(X) - ^y(X) > A(||X||,,i - ||X||,,i). 

For any X € ]R"^iX"^2, using X = X -f V\{X - X) + Vx{X - X), Lemma 16-(i) and the 
triangular inequality, we get 

||A|U,i > ||X||^^i + ||pJ(X-X)|U,i-||iP^(X-X)|U,i , 
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which implies 

<hY(X) - chY(X) > A (|| V\{X - X)|U,1 - \\V^{X - X)|U,i) . (30) 

Furthermore by convexity of $y we have 

^y{X) - $y(^) < (V ^y{X) \X -X) , 

which yields by duality 

^>y(X) - $Y(X) < ||V$y(^)|U,oo||^ - ^IU,1 < , 

<^{\\V\{X- X)|U,1 + \\V^{X- X)|U,i) , (31) 

where we used A > ||V ‘hY(-^)||CT,oo in the second line. Then combining (30) with (31) gives (i). 
Since X — X = 'P'^{X — X) + Vj^{X — X), using the triangular inequality and (i) yields 

<4||iP^(X-X)|U,i. (32) 

Combining (32) and Lemma 16-(i) leads to (ii). ■ 


Appendix B. Proof of Lemma 19 

Proof The proof is adapted from (Negahban and Wainwright, 2012, Theorem 1) and (Klopp, 2014, 
Lemma 12). We use a peeling argument combined with a sharp deviation inequality detailed in 
Theorem 22. For any a > 1, /3 > 0 and 0 < rj < l/2a, define 

e(r, a, rj) := ^ -(E||EK|U.oo)^r , (33) 

l/(2a) -r] 

and consider the events 


B := -{BX e C{P,r) 
and 


\AUX,X)-E[AI.{X,X)]\ > iMi^l^ + e(r,a,7/) 


TZi := {x G C(^,r)|a'-V < E [Ay(X,X)] < a^/^} 
Let us also define fhe sef 

C(/3,r,f) := {X G C(/3,r)| E [A^(X,X)] < t} , 

and 

Zt := sup |A|-(X,X) -E [Af.(X,X)] I . 

X&C{0,r,t) 


Then for any X G ^ n 7^; we have 


1 


I A2.(X, X) - E [A|-(X, X)] I > + e(r, a, r?) , 


( 34 ) 
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Moreover by definition of TZi, X G C/ 3 (r, Therefore 

BnlZi C Bi := {Z ^10 > + (-{r, a, r/)} , 

If we now apply a union bound argument combined to Lemma 22 we get 


+00 


+ 00 


¥{B) < E ^{Bi) < '^exp 


1=1 


1=1 


nrf{a^l3Y‘‘ 

874 


< 


exp(- 


nrp' \og(a)l3‘^ 
47 ^ 


1 — exp(— 


717]“^ \og{a)^^ 

47 ^ 


where we used a: < in the second inequality. Choosing a = e, rj = (4e) ^ and (3 as stated in the 
Lemma yields the result. ■ 


Lemma 22 Let a> 1 and 0 < rj < Then we have 

P {Zt > t/{2a) + e(r, a, 7 )) < exp {-niff/ ( 87 "^)) , (35) 

where e(r, a, rj) and Zt are defined in (33) and (34). 

Proof From Massart’s inequality ((Massart, 2000, Theorem 9)) we get for 0 < 7 < 1/(2q;) 

F{Zt > E[Zt] + rjt) < exp {-fnf/{8'yf) . (36) 

A symmetrization argument gives 


E[Zt] < 2E 

sup 

1 " 


XeCi/3,r,t) 

n 

2=1 


where e := (ei)i<i<n is a Rademacher sequence independent from {Yi,uji)/^i- The contraction 
principle ((Ledoux and Talagrand, 1991, Theorem 4.12)) yields 


E[Zt] < 4E 

sup 

1 

— ^^^ 2(^2 — Xi) 

= 4E 

sup |(S^|X-X)| 


X<£C{l3,r,t) 

2=1 


X&C{l3,r,t) 


where S/j is defined in (9). Applying the duality inequality and then plugging into (36) gives 
¥{Zt > 4E[||E/j||<^,oo]Vrt + 7 ^r/t) < exp {-fuf/iSj"^)) . 

Since for any a, 6 G M and c > 0, a 6 < (a^/c + cbf/2, the proof is concluded by noting that, 

4E[||SH|U,oo]v/rt < —-4 E[||Sr|U,oo]V + (l/( 2 a) - ft • 

l/( 2 a) -r] 


16 

















Exponential Family Matrix Completion 


Appendix C. Proof of Oracle inequalities and Bounds for Completion with known 
sampling 

C.l. Proof of Theorem 9 


Proof The proof is an extension (to the exponential family case) of the one proposed in (Koltchinskii et ah, 
2011, Theorem 1). For ease of notation, let us define Tf := V <hY(A) and the set T := {X G 
IR™ix™2| ||X||oo < t}- In view of Remark 1, one obtains 

^ ^ T:=lY^E, _ ^ T:=liY^E^-E[Y,E,]) 

n n 

From the definition of X, for any X G F, 


G^\X) - 




n 


< G^\X) - 


T:=lX^Y, 


n 


+ A(||X|U,i-||X|U,i) 


or equivalently 


G^{X) - G^{X) - {VG^{X) \X -X) 

< G^{X) - G^{X) - {XG^{X) \X -X) + {H\X -X) + A(||X|U,i - ||X|U,i) 

Applying Lemma 16 (ii),(iii) and duality yields 

D^{X,X) - Dj^{X,X) < A(||X - X||,,i + ||X|U,i - ||X|U,i) < 2A||X||,,i . 

where we used the assumption A > ||Ff ||( 7 ,oo- This proves (13). 

For (14), by definition 

XV 

X= argmin F{X) := G^{X) - ^ ^ + A||X|U,i + (5r(X) , 

xgR">ix ’"2 n 

where Jr is the indicatrice function of the bounded closed convex set F i.e., Jr(ic) = 0 if x G F 
and Jr( 3 ;) = +00 otherwise. Since F is convex, X satisfies 0 G dF{X) wifh dF denofing fhe 
subdifferenfial of F. If is easily checked fhaf fhe subdifferenfial OJr (X) is fhe normal cone of F af 
fhe poinf X. Hence, 0 G dF{X) implies fhaf fhere exisfs V G c)||X||ct,i such fhaf for any X G F, 

/V” YF \ 

(VG"(X) I X - X) - ( ' I X - X ) + A(I/ I X - X) < 0 , 

or equivalenfly 

(VG"(X) - VG"(X) I X - X) + A(L I X - X) < (iF I X - X) . 

For any x,x,x G M, from fhe Bregman divergence definifion if holds 

{G'{x) — G'{x)){x — x) = dcix, x) + dcix, x) — dcix, x) . (37) 

In addifion, for any V G c)||X||ct,i, the subdifferential monotonicity yields {V — V\ X — X) > 0. 
Therefore 

F)g(X, X) + D%{X, X) - D%{X, X) < {H\X - X) - \{V\X - X) . (38) 
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In Watson (1992), it is shown that: 


^\\X\\^,l = {Y,u^vJ +VjcW ||VF|U,oo < 1} , (39) 

i=l 

where r := rk(X), Ui {resp. Vi) are the left {resp. right) singular vectors of X and Vx is defined 
in (17). Denote by 5i {resp. S 2 ) the space of the left {resp. right) singular vectors of X. For 

lY £ ]^mixm2^ 

+pj^WlX-xj = /^^u,vHFs,(X-X)Ps,j + (wi Pj^(X)) , 

and W can be chosen such that (VF | Vx{X}} = \\Vx{X)\\a,i and ||VF||(t,oo < 1- Taking V E 
9||X||ct,i associated to this choice of W (in the sense of (39)) and || UivJ ||o-,oo = 1 yield 

D^{X, X) + Dg(X, X) - D^{X, X) + All Vjc{X)\\.,i 

<{H\X-X) + \\Ps,{X - X)Ps, |U,i . (40) 

The first right hand side term can be upper bounded as follows 

{H\X-X) = {H\Px{X - X)) + {H I PjciX)) 

< ||i7|U,oc(\/2rk(X)||X - X|U,2 + II Vjc{X)\\a,i) , (41) 

where duality and Lemma 16-(ii) are used for the inequality. Since rk(P 5 ^ {X — X)Psi ) < rk(2f), 
the second term satisfies 


\\Ps,{X - X)Ps,\l,i < V^||X-X|U,2 . (42) 

Using A > ||77||o-,oo> (40), (41) and (42) gives 

D^{X,X) + D^{X,X) + (A - ||i7|U,oc)|| VjciX)\\.,i 

<Dg(X,X) + A(l + V2)yAk(^||X-X|U,2 . (43) 

By HI and H2, ||X — X||o-,2 < 2mim2pD^{X, X), hence 

Dg(X,X) + (A-||i7|U,oo)||Pi(X)|U,i 

< D^{X, X) + (l±^)2a-2mim2/rA2 rk(X) , (44) 

proving (14). ■ 
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C.2. proof of Theorem 11 
Proof By the triangle inequality, 


\H\ 


a,CO ^ 




n 


+ 


a,CO 




n 


-E[G’{Xi)Ei 


a,oo 


(45) 


holds. As seen in the proof of Theorem 6 (in Section 3.2), the first term of the right hand side 
satisfies (26) with probability at least 1 — If we define Zj = G'{Xi)Ei — E[G’{Xi)Ei], then 
E[Zi] = 0 gives IIZjIIct^oo < 2L.y, with defined in (16). A similar argument to the one used to 
derive Equation (24) yields 


E 


Z7Z,; 


(T,00 


< ||E[(G'(X,)i^i)(G'(Ai)E;,)T]||^^^ < ^ 


m 


and the same bound holds for E[ZiZj]. Therefore, the uniform version of the noncommutative 
Bernstein inequality (Proposition 23) ensures that with probability at least 1 — d~^ 


T:=iG'{X,)E, 


n 


-E[G'{Xi)Ei 


< c max 




cr,CO 


21og(d) ^ 


n 


3n 




Combining (26), (46) with the assumption made on n in Theorem 11, achieves the proof. 


Proposition 23 Consider a finite sequence of independent random matrices (Zj)i<i<n C 
satisfying E[Zj] = 0 and for some U > 0, ||Zj||o-,oo < U for alii = 1,..., n. Then for any t > 0 


( 

I n 

\ i=i 


> f < dexp ( — 


nif j 2 


cr^ + [/f/3 


where d = mi + m 2 and 


:= max ■ 


1 " 
n ^ 


i=l 


1 


n 


n 


Y^nzjz,] 


In particular it implies that with at least probability 1 — e 


2=1 

-t 


1 


n 


Ez< 


2 = 1 


< c* max < az 


t + log(d) U{t + log(d)) 1 


n 


3n 


7- 


with c* = 1 + s/3. 


Proof The first claim of the proposition is Bernstein’s inequality for random matrices (see for 
example (Tropp, 2012, Theorem 1.6)). Solving the equation (in t) + log(d) = —v gives 

with at least probability 1 — 


1 

n 


Ez. 



(J,00 


flP 

{v + log(d)) + Y -^(^ + log(d))2 + 2ncr|(u + log(d)) 
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we conclude the proof by distinguishing the two cases na\ ^ (^^/9)(^ + log('3^)) or rnrf > 
{U^/9){v + \og{d)). ■ 
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